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1. INTRODUCTION 

Excess information available on the Internet makes searching for information a difficult task. To 
achieve recommendations that satisfy its users, recommender systems are cooperating not only with the users 
but also with the semantics of the data from those users as well. Several approaches, evaluation data sets as 
well as methods and metrics have been published. It is, therefore, necessary to do a study in this area. Since 
1998 there are more than 3,000 publications about recommender systems [1]. Recommender systems (RS) of 
a specific item have become useful and important for many publications or papers. While the term was 
coined in the early 90s, it became popular in 1997 with the important special issue of RS by Paul Resnik in 
Communication of the association for computing machinery (ACM), and as the result, there was a rise of RS 
continuously. RS was developed as an independent research field in the mid-1970s at Duke University [2]. 
They did this path of development not only but with the help of artificial intelligence (AI), information 
extraction, and human-computer interaction. Given the immense amount of information, RS has become very 
popular and plays an important role as well on websites like Amazon, Netflix, YouTube, Facebook, 
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Foursquare, TripAdvisor. Figure 1 depicts the years of publications of the papers that we selected for this 
literature review. RSs are primarily directed toward users who lack sufficient personal experience or 


competence to evaluate the potentially irresistible number of alternate items that a website, for example, may 
offer [3]. 


10 


# of papers 


, T 
o HHNNNEGCENEGECGEGNSN 
Moto MT NnORWDDOANMNTNHONRNWAD 
Nnooocococeocaqoqoa Ott e td eat te ddd A 
NAMAoeodncenncdeno0cn0o 0 0o0 0 00 00 0 
AANNNNNNNNNNNNNNN NN 
Year 


Figure 1. Number of papers selected for our literature review, through the years 


Following this trend, recommendations that are based on social network analysis (SNA) [4] are 
becoming very popular and worth considering for each field of social computing, even for those that did not 
consider the users’ aspects before. The following findings are elaborated in the next sections of this paper, as 
part of our contribution to this literature review: 

— Provenance or network trust, which refers to all kind of information that depicts, illustrates, and analyze 
the process of production for a certain product, is evaluated as a very feasible measure that could increase 
the authority of certain information selected for a specific recommendation, 

— Our motivation for this literature review stands in the fact that from the research conducted regarding the 
issue of utilization of both: SNA metrics together with provenance metrics, in yielding to the better 
recommendation, it is seen that each of the approaches is treated separately, hence the intention is to 
define and correlate SNA metrics and network provenance to develop and enhance the baselines of 
certain RS, 

— Furthermore, the most widely used approaches and the features of recommendation presentation 
techniques are evaluated among the reviewed papers, 

— The identification of adequate probabilistic techniques among Machine learning classification algorithms 
is conducted, together with the identification of methods of evaluation in the community of 
recommendations. 

In section 2, a method of literature selection for this review will be elaborated, where related 
scientific fields and types of publications are depicted. Section 3 of this paper will cover publications’ 
detailed review and research findings, based on the research questions that we have raised before conducting 
this literature review. Furthermore, a summary of the literature review’s main techniques, features, and most 
feasible approaches have been identified and chosen to be implemented in the next phase. Section 4 
concludes with main remarks and findings from this literature review, together with a plan for expansion in 
future work. 


2. METHOD 

For our study, we used Google Scholar to identify relevant literature. Google Scholar is known as a 
search engine and place for novice researchers because it is easy to search extensively. More material can be 
found using it compared to other databases, and one of the reasons may be that it gives results not only for 
full-text articles online. The citation index has been used as well, which includes citations in books, book 
chapters, government reports, confederal proceedings, and magazines. The citation number in Google 
Scholar is higher compared to other sources. If we wanted to know about its ranking algorithm, then the 
number of citations is the highest factor in it [5]. Additionally, we have gathered and conducted a literature 
review on literature from ACM articles, dell computer science bibliography (DBLP), Ph.D. thesis, and other 
relevant journals and conferences. 
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The criteria on which we collected and classified papers chosen for this literature are keywords and 
abstract content. The following are the main keywords that we used: recommender systems, social network 
analysis, provenance, trust, social networks, social media, semantic web, ontology. Due to our relevance on 
the topic, we have grouped the papers into three different categories: i) papers related to provenance and 
social network analysis, ii) papers related to recommender systems and provenance, and iii) papers related to 
recommender systems and social network analysis. 


2.1. Related scientific fields and types of publications 

Most relevant computer sciences digital libraries were used during the literature review papers 
collection, such as ACM Digital Library, Springer Open, IEEE Xplore, Google Scholar, and Elsevier. In the 
initial phase, based on the titles and keywords 2,118 papers were selected. In the next phase, based on the 
abstract content filtering related to our field, 1,305 papers remained, of which only 318 remained in our 
collection when we removed duplicate papers, short papers and posters, and updated versions of the old 
papers. In the final phase, only 70 papers were selected for our literature review, being relevant to our topics. 
Figure 2 depicts the literature review selection process. 
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Figure 2. Literature selection process 


Among the publications depicted for review analysis, the percentage among types of papers is: 41% 
of the papers selected are conference papers, 35% are journal articles, 15% are book chapters, 2% are review 
reports, and 7% are thesis related to the topic, as depicted in Table 1. As it is shown in Figure 3 since our 
research topic involves three groups and relations, 50% of the papers chosen for the literature survey belong 
to the field of provenance and social network analysis (SNA), 30% are from the field of provenance and 
recommender systems, and 20% are related to SNA and recommender systems. 


Table 1. Types of publications 


Publication types Percentage 


Conference 41% 
Journal 35% 
Book chapters 15% 
Report 2% 
Thesis 7% 
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Figure 3. Paper’s topic distribution 


3. PUBLICATIONS’ REVIEW AND RESEARCH FINDINGS 

The dimensions and parameters used for the analysis of similarities and differences of approaches 
are techniques used; data sets used for evaluation and performance measurement, with methods and metrics 
used for evaluation. Furthermore, we presented the results of a literature review, as answers to the research 
questions that were risen during the proposal phase. 

— RQI: What are the proposed solutions for utilizing network provenance and social network analysis 
(SNA) metrics and on recommender systems? 

As elaborated in the methodology section, the review of publications is as well oriented among our 
specific domains, and in this case, related to seeking the provenance and information trust among social 
media and networks [6]-[8]. The problem of missing information that can be traced by analyzing the features 
of social media and networks, adding here the context of the information that has been received, to gain trust, 
is one approach that has been used on [6] on this rationale. In another contribution, using users' profile 
information and attribution that were available on social profiles, Gundecha et al. [7] created a tool to collect 
those attributes of users' but always concerning the context of received information. The inclusion of mutual 
friends from their previous interaction is utilized in [8] in the aim to create a model that is based on 
provenance and a trust model. 

However, it is always hard to argue if the information that is used for recommender purposes is 
adequate, and this issue is elaborated in [9] where authors to detect the spread of fake information, proposed 
an algorithm that evaluates the trustworthiness of information spread among social networks. In [10] the 
objective was set to build an ontology model that incorporates provenance retrospectively, so the intention 
was to provide information beyond standards that are already in use. In [11] it is argued that harmful and 
malicious data and information can propagate through the system, especially when the model is offline. 

How trust is achieved in various application domains and fields by describing the cycle of the 
provenance itself is an approach that is argued in [12], where authors by analyzing advantages and 
disadvantages presented a taxonomy of the actual schemas that are used for provenance. Being the field that 
is interrelated to other research fields as well, in [13] authors gave a brief introduction and outgoing research 
toward the provenance solutions, including the development of databases as in natural language interfaces 
[14], [15] or frameworks built on queries [16], [17]. 

In another study [18], each comic character forms the strip, which is represented as an activity, and 
the overall visualization was created by recording the provenance graphs. Data citation and provenance 
relationship, distinguished by a certain model to represent views of the citation, and their relation in query 
construct is presented in [19], followed by another paper, where authors presented an application that uses the 
combination of the provenance values and trust values of other observations to personalize their website 
content, hence creating a two-leveled model, by including provenance trust and observations of the system as 
well [20]. The issue of provenance and data sources is represented on [21], where authors argued and 
presented a PROV model. 

In the most significant book, related to the research topic of this literature review, the authors dealt 
with one of the most important issues on the social network, which is the trust in and between users in a 
network. As described by authors in existing social media infrastructure there are not certain regulations that 
prove the trust and provenance of data regarding the users and entities, hence searching for the provenance of 
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data used opens a whole new problem in this field, which requires the development and usage of certain 
metrics to provide sufficient trust for a certain group of data. Additionally, Barbier et al. [22] defined the 
provenance data tributes, together with the possibility to measure these attributes. 

A mechanism for classification the social signals by using the identities of the sources, where many 
of these classification duties are classified as output signals, is presented in [23]. In Herschel et al. [24] gave 
an overview of social networks and data provenance, by addressing three main questions, and they achieved 
to get a classification for each of the classes. The classification includes four main provenance types, namely 
provenance meta-data, information systems provenance, workflow provenance, and data provenance [25], as 
depicted in Figure 4. Another important finding on this aspect is the relation that the provenance model has 
with the level of instrumentation, thus concluding that the more specific that the provenance model is, the 
level of instrumentation is higher. 
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Figure 4. Provenance hierarchy [24] 


— RQ2: What classification methods can be utilized among trust models to satisfy recommender systems 
(RS) that are related and dependent on the provenance and trust of the data, that their datasets use? 

Concerning this research question, we reviewed publications related to provenance and 
recommender systems, and as per our findings in one significant paper in this rationale, a study was 
conducted on current trends on recommender systems that are focused and related to trust-aware, combined 
with the analysis of graph-based [26], where it is emphasized that the usage and computing of trust in many 
fields it is a common effort among scientists. Statistical techniques have been more elaborated, and there is a 
subdivision on probabilistic techniques (Bayesian systems, beliefs model, and Markov models) and machine 
learning (artificial neural networks), as represented in Figure 5. 
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Figure 5. Classification of trust models and their categories [26] 


In another contribution, by using fuzzy logic to compute the trust that was inferred from sources 
they had belief, authors built a model based on socio-cognitive features [27]. Selvaraj and Anand [28] by 
using proposals for changing trust in authors of the document, proposed a tool that integrated a whole 
ontology. In another approach, genetic algorithms combined with the historical data in peer networks were 
used to infer trust [29]. Friend-of-a-friend (FOAF) user-profiles and their trust relations were used to build a 
trusted network in a semantic web [30], meanwhile in [31] authors used a method to rank nodes in a network 
based on local and groups trust. 

In Akhtarzada et al. [32] argued the usage of a specific method, where trust was calculated as a 
weight of scores, where they included distrust, inconsistency, trust, and ignorance. However, it was argued 
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that the inclusion of distrust in calculation makes the process very challenging, and in this reasoning, in [33] 
another approach is presented, which instead used other five parameters which are interchangeable, besides 
the trust, that remains as a common value. In this paper entities are items that are being recommended, and 
there was a group of values that were taken into consideration, hence forming a matrix with various 
dimension values, where a user could rate an item with more than one specific characteristic. In another 
approach, the authors presented a recommendation engine that would use metrics for the social aspect of a 
user [34], and then these data would be applied to a graph, to represent those characteristics of the object. 

The relationship among users and information regarding the users’ profile to generate predictions 
was the focus of Sinha and Swearingen [35], which proved to be feasible since today this logic is used among 
all social network platforms. In this relation, it was pointed out a common understanding that we tend to 
believe more and take suggestions from people that we know than to take actions based on recommendations 
from anonymous sources and people, as mentioned in [36]. To include both trust and recommender systems 
in their solution, Andersen et al. [36] argued the creation of a recommender system that would generate 
personalized recommendations based on the input that they had from a group of users, related to a specific 
topic, and by calculating trust as well, by calling this an axiomatic style. Further, they mentioned that these 
axiomatic methods have been used in systems that have utilized both personalized ranking systems [37], [38] 
and global ranking as in [39]-[42]. 

In study [43] a survey was conducted to evaluate recommender systems that included trust in their 
approaches. Traditional techniques starting from collaborative filtering (CF) that used only trust and up-to 
techniques that included similarity among users and trust between them were analyzed. From empirical 
studies results that were collected on different datasets, it was argued that CF has some fundamental 
weaknesses during the process of finding similarities among users. Another interesting approach was used by 
He and Chu [44], where they asked the respondents to self-evaluate and qualify their work based on the 
quality. In this way, they had scores of the work, and then they added provenance metrics on the overall 
score, so they completed a metadata profile for all users. 

In terms of the pure involvement of SNA on recommender systems, several interesting contributions 
have been analyzed for our literature review. In research [45], a model that uses a probabilistic calculation is 
presented, where it is argued that besides the recommendations that might come from social networks, people 
tend to believe more in the opinion of their friends than the recommender system output. In another 
approach, the objective was to investigate if data warehouse or big data is healthier for recommender 
systems, and that they could do that in two ways: qualitative and quantitative analysis [46]. Then 
recommendation results were evaluated by previous tourists, and the results were very promising, thus 
motivating the usage of SNA metrics in RS. Another book chapter [47] presented an approach that justifies 
the linkage between recommender systems and social network analysis, and their benefits as well. 

Main techniques were analyzed and discussed, and the importance and inclusion of SNA in 
recommender systems are emphasized as well. Due to the raise of usage, authors in this chapter propose the 
development and usage of more advanced algorithms in the RS. In another approach, Sellami et al. [48] 
proposed to use SNA only as a tool to identify collaborations within groups and cliques, by utilizing size, 
degree centrality, network centrality, and density. By using these metrics and techniques authors intended to 
identify collaborating nodes and groups, and they were able to identify isolates within the network as well. 
The usage of a semantic social recommendation system is presented in [49], where social network analysis 
measures were used to evaluate the powers of semantics in RS. 

In a trend of advancing the RS overall, a hybrid social network analysis approach, combined with 
collaborative filtering, by selecting certain groups of users and applying clustering analysis by using the 
information that was retrieved, to achieve a conventional clustering algorithm, was presented in [50]. In 
another approach [51], in a network that used films and media, to achieve a high satisfaction in a 
recommendation, the opinion of friends of the friends in the network is considered for the final output. 
Further, a survey was conducted by analyzing methodologies amongst RS that use SNA in six fields: the use 
of performance measures, recommendation approaches, research domains, data sets used in each domain, 
data mining techniques, recommendation type [52]. The link weight between users and between interactions 
was analyzed in [53], to obtain user preferences and item recommendations, hence the social influence was 
measured, by depicting the influence as macroscopic and microscopic. In Frikha et al. [54] presented a 
solution that was able to avoid a cold-start problem, by using users’ existing social network data to make 
initial recommendations. A combination of approaches is justified in another contribution, where authors 
argued that people tend to trust recommendations that are made by people or friends they know, rather than 
strangers [55]. Additionally, they presented an ontology that is related to recommendations, but now it 
represents the semantics of these items. Utilizing RS in various domains, as in economics, is presented in 
[56], where authors proposed a solution to a recommender system that would consider transactions of mutual 
funds between people, to recommend a certain investment on a potential stock exchange. The involvement of 
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SNA in the touristic recommender system was proven as feasible by Ciceri et al. [57], where they used 
in-degree and authority degree centrality measures to rank points of interest to a touristic tour. Then 
recommendation results were evaluated by previous tourists, and the results were very promising, thus 
motivating the usage of SNA metrics in RS. 


3.1. Findings and correlation 

Considering three fields from our review, Figure 6 depicts the summary of the main techniques, 
features, approaches that were considered in the reviewed papers, on each field respectively. As shown in 
Figure 6, we have highlighted (in green) content-based filtering techniques, combined with term frequency- 
inverse document frequency (TF-IDF) features as the most feasible approach, when combined with 
provenance, since our focus and intention is to find and recommend the most trusted items, so trust-aware 
RS, where trust, distrust, and ignorance are calculated as weight in terms of a relationship between nodes on 
a network. Furthermore, when machine learning techniques are involved, from our perspective the usage of 
probabilistic techniques, more specifically a Bayesian system/classifier seemed like the most feasible 
approach. 

Having these parameters form RS and provenance, then following the trends reviewed on literature, 
we concluded that involving SNA centrality metrics might as well, improve the quality of recommendations, 
thus by including exponential rank (which calculated the trust of nodes within the network) and inverse 
closeness centrality, we will be able to identify the most trusted nodes within the network, and in this way, 
we can choose then who we trust more in their recommendations. Our main contribution, which derived by 
analyzing the results and findings of the reviewed papers in this report, will be the involvement and usage of 
the Naive Bayes classification technique among a certain dataset, which has proved as a feasible approach, 
based on the preliminary results of an ongoing study that our research group is involved. 
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Figure 6. Summary of the literature review main techniques, features, and approaches (in green-most feasible 
chosen by our research) 


Further, by analyzing publications covered in this literature survey the following are the main 
findings: 

— The content-based filtering (CBF) approach remains the most widely used, in which the items that are 
recommended are similar to what the user knows (using Similarity), and as emphasized in [58] where 
collaborative filtering is combined with content-based systems to improve the recommendation results 
overall. Since the number of items is high while the number of users is low and very few of the users rate 
the same items, CBF is used to cover cold start and sparsity problem, although CBF also has its limits. 
For this reason, it would be even more effective if the CBF based on content similarity would be 
combined with the trust degree calculated through social relations, respectively the social network. 
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— The features of the recommendations are in most cases presented with the vector space model and 
weighed with the TF-IDF and in other cases with the latent semantic analysis (LSA) with the help of the 
latent Dirichlet allocation (LDA) distribution of topics. 

— A progressive step in the RS and its relation to all other fields, it is to emphasize the adaptation of 
machine learning algorithms [59] were mainly using mostly artificial neural networks (ANN) and 
Bayesian classifiers, as well as probabilistic techniques, such as Bayesian systems, Belief models 
(Dempster-Shafer theory, Subjective logic), and Markov models, where the majority are based on the 
matrix factorization techniques [60]. For example, instead of using the classic bag-of-words method that 
does not consider semantic similarities between items, the recurrent neural network (RNN) is used. 
Additionally, through the MapReduce Algorithm, it is shown that the big data field is suitable in RS 
studies, where users make programs for data processing in parallel. 

— Regarding the use of data sets for evaluation, the most common remains CiteSeerx, a powerful source, in 
data mining, machine learning, and information retrieval that use enriched metadata, followed by IEEE 
library, Web of Science, ScienceDirect, Scopus, ResearchGate, Academia, Google Scholar, DBLP, 
Elsevier. 

— Despite the criticism, offline evaluation is dominant as a method of evaluation in the community of 
recommendations. The reason that offline evaluation may be more favorable is that the results are more 
quickly available compared to online evaluation and user study. 


4. CONCLUSION 

In this paper, we have presented a literature survey of approaches and influences that SNA and data 
provenance has on RS, by analyzing different approaches, evaluation data sets, and evaluation methods used 
in this field and considering the comparison with other studies presented on a significant number of 
publications. Different approaches and data sets are introduced with the dimensions that allow to compare 
them with each other and determine each of them is suitable for the environment. For future work and 
according to the trends, RS approaches are expected to be based on the most advanced machine learning 
techniques. Some of the approaches are believed to expand in terms of the scope of the recommendation, 
such as in movies, tourism, calls for publications, study grants, and so on. As explained in this paper, 
trustworthiness issues in recommender systems, remain a major concern. For future work, we intend to use 
machine learning approach, and naive Bayes classifier as one of the most powerful classification methods, to 
improve the accuracy of recommendations and increase the confidence of recommendations for users in the 
network. 
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